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Multiple-processor  systems  can  provide  higher  performanceand  higher 
reliability /availability  than  single-processor  systems.  4n  ordef\o  properly 
assess  the  effectiveness  of  multi-processor  systems,  measures  that  combine 
performance  and  reliability  are  needed.  We  desciib^  the  behavior  of  the 
multi-processor  system^as  a  continuous&me  Markov  chain  and  associate 
a  reward  rate  (  performance  measure  )  with  each  state.  We  evaluatfr  the 
distribution  of  perform  ability  for  analytical  models  of  a  multi-processor 
systematising  a  recently  improved  polynomial-time  algorithm  that  obtains 
the  distribution  of  performability  for  non-repairable  as  well  as  repairable 
systems  with  heterogeneous  components  with  a  substantia]  speedup  over 
earlier  work.  The  system  that- -w^analy re^with  several  Markov  reward 
models  is  the  (  C.mmp  )  multi-processor  system  developed  at  Carnegie 
Mellon  University.  The  example  indicates  that  distributions  of  cumula¬ 
tive  performance  measures  over  finite  intervals  reveal  behavior  of  multi¬ 
processor  systems  not  indicated  by  either  steady-state  or  mean  values 
alone. 
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1  Introduction 

The  proliferation  of  fault-tolerant  multiple  processor  systems  has  given 
rise  to  the  need  to  develop  composite  reliability  and  performance  measures. 
For  this  purpose,  Meyer  [20]  developed  a  conceptual  framework  of  performa- 
Ulity.  In  this  paper,  we  consider  performability  models  based  on  Markov 
Reward  Models  (MRMs).  We  obtain  a  variety  of  performability  measures 
on  several  models  of  a  multi-processor  system  to  illustrate  the  effect  of  dif¬ 
ferent  fault-tolerant  mechanisms  on  the  ability  of  the  system  to  complete 
useful  work  in  a  finite  time  interval.  In  the  course  of  this  study,  we  show 
that  the  distribution  of  accumulated  reward  illuminates  effects  that  are  not 
detected  by  steady-state  values,  instantaneous  measures,  or  expected  values 
of  cumulative  measures.  Hence,  the  performability  distribution  provides  new 
insight  on  the  behavior  of  multi-processor  computer  systems.  We  describe  a 
new  0(ns)  algorithm  for  the  computation  of  the  distribution  of  accumulated 
reward  in  a  finite  utilization  interval  where  n  is  the  number  of  states  in  the 
MRM. 

The  evolution  of  the  system  through  configurations  with  different  sets  of 
operational  components  is  represented  by  a  continuous-time  Markov  chain 
(CTMC)  which  we  refer  to  as  a  structure-state  process.  The  set  of  rewards 
associated  with  the  states  of  a  structure-state  process  are  referred  to  as  the 
reward  structure.  Together  the  structure-state  process  and  the  reward  struc¬ 
ture  determine  a  Markov  Reward  Model  (MRM).  Because  the  time-scale  of 
the  performance-related  events  ( e.g .,  instruction  execution,  job  service)  is  at 
least  two  orders  of  magnitude  less  than  the  the  time-scale  of  the  reliability- 
related  events  («.«.,  component  failure,  component  repair)  steady-state  val¬ 
ues  of  performance  models  are  used  to  specify  the  performance  levels  or 
reward  rates  for  each  structure  state. 

We  analyze  several  MRMs  of  a  multi-processor  system  with  16  proces¬ 
sors,  16  memories  and  a  crossbar  switch.  In  Appendix  A  we  describe  an 
improved  algorithm  to  obtain  the  performability  distributions  from  MRMs 
with  n  structure-states  that  provides  an  0(n)  speedup  over  the  earlier  al¬ 
gorithm  in  (19].  The  algorithm  may  be  applied  to  MRMs  constructed  for 
repairable  or  non-repairable  systems.  We  demonstrate  the  use  of  our  al¬ 
gorithm  on  a  problem  of  moderate  size.  Previously  published  results  on 
performability  distributions  for  finite  time  intervals  have  been  carried  out 
only  on  very  small  problems.  With  the  multi-processor  system,  we  exam¬ 
ine  the  effect  of  different  modeling  assumptions  on  a  number  of  measures 
including  the  distribution  of  accumulated  reward. 

The  freedom  to  modify  the  structure-state  process  as  well  as  the  reward 
structure  allows  the  modeler  to  represent  a  wide  variety  of  situations.  In 
the  performability  domain,  there  are  two  extremes.  First  we  may  have 
a  structure-state  process  with  only  a  single  state  and  a  possibly  complex 
performance  model  to  generate  the  reward  associated  with  the  single  state. 
A  'pure*  performance  model  that  ignores  failure  and  repair  but  considers 
memory  contention  overestimates  the  ability  of  the  system  to  complete  useful 
work.  On  the  other  extreme,  a  'pure*  availability  model  ignores  different 
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levels  of  performance  (other  than  operational  or  failed).  A  model  that  takes 
into  account  both  aspects  of  system  behavior  by  a  combined  performability 
measure  is  more  appropriate  for  the  evaluation  of  computer  systems  that 
may  undergo  a  graceful  degradation  of  performance.  After  completing  the 
introduction,  we  describe  the  multi-processor  system  in  section  2.  In  section 
3,  we  present  results  for  MRMs  of  the  multi-processor  system.  In  Appendix 
A  we  describe  and  analyze  the  computational  cost  of  the  algorithm  used 
to  determine  the  distribution  of  accumulated  reward  for  cyclic  or  acyclic 
MRMs. 

1.1  Notation 

The  evolution  of  the  system  in  time  is  represented  by  the  finite-state  stochas¬ 
tic  process  {Z{t),  t  >  0),  which  characterizes  the  dynamics  of  the  sys¬ 
tem  structure  and  environmental  influences.  Z[t)  €  S  =  {l,2,...,n} 
is  the  structure-state  of  the  system  at  time  t.  The  holding  times  in  the 
structure-states  are  exponentially  distributed,  and  hence  Z(t )  is  a  homo¬ 
geneous  CTMC.  Even  in  situations  where  the  holding  times  are  generally 
distributed,  they  may  often  be  acceptably  approximated  using  a  finite  num¬ 
ber  of  exponential  phases  [14].  We  let  g,y  be  the  transition  rate  from  state 
*  to  state  j  and  Q  =  [ <y,j  ]  be  the  n  by  n  generator  matrix  where 

n 

?<»  =  ~  22  9v  • 

Let  p,(t)  denote  Prob[  Z[t)  =  i  ],  the  probability  that  the  system  is  in 
state  i  at  time  t.  The  column  vector  £(f)  of  the  state  probabilities  may  be 
computed  by  solving  a  matrix  differential  equation  [23]: 

Jt  e(0  =  Qt  e(0  *  0) 

The  steady-state  probability  vector  x.  of  the  Markov  chain  is  the  solution 
for  the  linear  system  : 

QT  2.  =  0  £  *,  =  i  . 

• 

Let  r,-  be  the  reward  rate  (or  the  performance  level)  associated  with 
structure-state  i;  then  the  vector  r  defines  the  reward  structure.  The  reward 
rate  of  the  system  at  time  t  is  defined  to  be  X(t )  =  We  let  K(t)  be 
the  accumulated  reward  until  time  f,  that  is,  the  area  under  the  X(t)  curve, 


Y(t)  =  fx{r)dr  . 

JO 


Consequently,  by  interpreting  rewards  as  performance  levels,  we  see  that  the 
distribution  of  accumulated  reward  is  at  the  heart  of  characterizing  systems 
that  avolve  through  states  with  different  reward  rates  (e.g.,  performance 
levels).  In  Figure  1  we  depict  a  Markov  reward  model  with  a  3-state  CTMC 
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ment&ry  distribution  are  denoted  as: 

W(t)  s  =  \  f* X(r)dr  ,  W(z,t)  5  Prob[W(t)  <  z]  and  Wc(z,t)  =  Prob[W(t) 

A  special  case  of  W (t)  is  obtained  when  we  assign  a  reward  rate  1  to  op¬ 
erational  states  and  zero  to  non-operation  a]  states.  In  this  case,  W(t)  is 
known  as  the  interval  availability  Aj[t).  The  complementary  distributions 
explicitly  answer  the  questions  of  the  modeler  and  are  easily  obtained  from 
the  results  for  l/(z,  t)  in  Appendix  A.  To  complete  our  notation,  we  note 
that  we  have  assumed  a  distinguished  initial  state.  To  explicitly  indicate 
this  dependence  on  the  initial  state  we  will  use  a  subscript  on  cumulative 
and  time-averaged  random  variables  and  their  distributions.  For  example, 
denotes  the  time-averaged  accumulated  reward  for  the  interval  (0,  t) 
given  that  the  initial  state  is  t  (i.e.,  Z[ 0)  =  »). 

The  ability  to  complete  a  given  amount  of  work  with  probability  one  is 
a  property  of  some  Markov  Reward  models.  An  MRM  is  said  to  have  the 
completion  properly  if  does  not  have  a  reachable  closed-set  C  of  states  such 
that  =  0  for  all  *  £  C.  As  an  example  of  an  MRM  with  the  completion 
property,  consider  Figure  la  with  all  parameters  greater  than  zero.  Since 
the  probability  of  remaining  in  structure  state  3  for  all  but  a  finite  amount 
of  time  in  an  infinite  time  interval  is  zero  and  structure  states  1  and  2  have 
non-zero  reward,  any  finite  amount  of  reward  will  be  accumulated  if  the 
time  interval  is  long  enough.  Because  descriptions  of  fault-tolerant  systems 
almost  always  include  “failed"  {zero  reward)  structure  states,  we  will  refer 
to  MRMs  of  fault-tolerant  systems  that  take  repair  actions  from  all  “failed” 
structure  states  as  MRMs  with  completion.  The  completion  property  is  a 
useful  distinction  because  it  indicates  the  most  appropriate  measures  for  a 
model.  MRMs  with  the  completion  property  are  appropriately  described 
with  ‘H'(z,t),  while  models  without  it  are  readily  described  with  l /(z,t). 

Those  Markov  models  without  the  completion  property  will  be  referred  to 
as  MRMs  with  imperfect  repair.  An  MRM  in  which  operational  states  are 
assigned  reward  rate  1  and  non-operational  states  are  assigned  rate  0  are 
called  availability  modelt.  In  an  availability  model,  if  we  further  require  that 
all  non-operational  states  are  absorbing  then  we  have  a  reliability  model. 

1.2  Previous  Work 

Early  attempts  to  evaluate  fault-tolerant  computer  systems  were  restricted 
to  transient  analysis  of  the  CTMC  describing  the  evolution  of  the  system 
over  time.  The  immediate  result  relating  the  transient  probability  to  the 
probability  of  the  system  operating  at  a  specified  reward  level, 

Prob(  X(t)  =  r  )  =  £  ProbJ  *(*)  =  j  J  =  £  p,(t)  , 

O'  I  fi=f)  0  I  »■>='■) 

was  exploited  by  Huslende  [15]  and  Wu  [28]. 

Gracefully  degrading  systems  provide  useful  computation  by  reconfig¬ 
uring  to  adjust  to  the  failure  of  one  or  more  components.  Beaudry  used 

5 


the  notion  of  computation  availability  which  in  our  notation  ia  the  expected 
reward  rate  at  time  t: 

El  *(0  1  =  IT£(0  =  , 

i 

and  its  limiting  value  : 

flhn  £{  X(t)  J  =  rT*  -  . 

These  two  quantities  are  generalizations  of  instantaneous  and  steady-state 
availability,  respectively.  Huslende  considered  performance  reliability  by 
assuming  a  minimum  performance  threshold: 

/{(threshold,  t)  =  Prob[  X(r)  >  threshold,  Vr  <  t  ]  ; 
a  generalization  of  reliability. 

Under  general  assumptions  about  the  stochastic  process  {Z(t),  t  >  0} 
and  the  reward  structure  r,  Howard  [13]  studied  the  expected  accumulated 
reward  Ef  Y[t)  ]  for  finite  intervals  of  time  and  the  expected  time-averaged 
accumulated  reward  over  an  infinite  time  interval.  It  is  interesting  to  note 
that  the  limit  t  — ►  oo  of  the  expected  value  of  X(t)  and  tV(t)  are  equal: 

ton  E[lV(t)j  =  =  Hrn  £  [  X(t)  J  . 

With  our  notation  we  can  express  E[  Y  (t)  ]  as  : 

SIVW1  =  =  l'E[X(r))d r  =  £>,  f  p.(,)d,. 

JO  JO  i  JO 

To  compute  £|V(t))  we  define  L,(t)  =  /J  p<(r)dr  to  be  an  element  of  l*[t) 
and  derive  a  system  of  ordinary  differential  equations  for  L(t)  by  integrating 
equation  (1)  : 

~m  =  QTL{t)  +  e(0)  . 

Solutions  are  readily  calculated  using  methods  similar  to  those  used  to  solve 
equation  (l).  Often  we  are  interested  in  the  behavior  of  Y[t)  far  from  the 
mean  (as  is  the  case  when  a  system  is  required  to  deliver  a  specific  reward 
with  high  probability),  and  in  this  case  the  central  moments  do  not  provide 
accurate  information.  Consequently  measures  that  provide  a  more  detailed 
look  at  system  behavior  are  needed. 

Recently,  considerable  attention  has  been  given  to  the  problem  of  evalu¬ 
ating  the  distribution  of  accumulated  reward,  y(*,t).  The  problem  is  more 
easily  solved  if  the  distribution  of  accumulated  reward  is  to  be  evaluated  over 
an  infinite  time  interval.  Beaudry  (l]  has  shown  that  the  distribution  of  ac¬ 
cumulated  reward  until  system  failure  (^(x,  oo))  for  a  system  with  imperfect 
repair  can  be  obtained  as  the  time-to-failure  distribution  of  an  associated 
CTMC  obtained  by  simply  dividing  the  rates  of  transitions  leaving  a  given 
state  i  by  t{. 


For  finite  time  intervals,  Meyer  [21]  obtained  the  distribution  of  accu¬ 
mulated  reward  in  acyclic  Markov  reward  models  (no  loops  in  the  structure- 
state  CTMC)  with  r,  being  a  monotonic  function  of  the  state  labeling.  A 
direct  approach  that  numerically  integrated  the  convolution  equations  in  the 
time  domain  for  acyclic  models  was  developed  and  implemented  by  Furcht- 
gott  and  Meyer  [9].  The  computational  complexity  is  exponential  in  the 
number  of  states  so  the  applicability  of  the  direct  time-domain  approach 
is  limited  to  problems  with  a  few  states  over  a  short  time  interval.  Subse¬ 
quently,  Goyal  and  Tantawi  [10]  developed  an  0(ns)  algorithm  to  compute 
the  distribution  of  accumulated  reward  in  general  acyclic  structure-state 
processes  with  monotonic  reward  rates.  Ciciani  and  Grassi  [3]  and  Do- 
natiello  and  Iyer  [6]  proposed  algorithms  that  do  not  require  the  rewards  to 
be  monotonic. 

MRMs  that  have  cyclic  structure-state  CTMCs  are  more  difficult.  By 
using  the  central  limit  theorem,  it  can  be  shown  that  the  asymptotic  distri¬ 
bution  of  the  accumulated  reward  over  a  time  interval  (0,  t)  for  t  sufficiently 
large  is  normally  distributed  with  mean  limf._00  £pf(r)]  multiplied  by  t 
and  variance  a y/t.  Computational  methods  to  determine  limT_0O  E[X(r)] 
and  a  may  be  found  in  Hordijk  et  al.  (12]. 

Iyer  et  al.  [16]  describe  a  recursive  technique  for  computing  moments  of 
the  distribution  of  accumulated  reward  for  cyclic  MRMs.  With  the  moments 
in  hand,  bounds  on  the  distribution  of  accumulated  reward  are  available. 
As  noted  earlier,  because  the  central  moments  describe  the  behavior  of  the 
distribution  about  the  mean,  the  bounds  are  often  too  loose  to  be  helpful 
at  the  extremes,  which  are  often  or  interest.  The  difficulties  are  similar  to 
those  one  faces  extrapolating  the  value  of  a  continuous  function  a  distance 
away  from  a  point  where  all  the  derivatives  are  known. 

More  recently,  Goyal,  Tantawi,  and  Trivedi  [11]  formulated  the  interval 
availability  problem  (a  special  instance  of  W(t),  for  a  reward  structure  with 
reward  rates  r,-  =  1  if  state  t  is  operational  and  zero  else)  as  a  system  of 
first  order  partial  differential  equations.  The  randomization  technique  has 
also  been  applied  to  the  interval  availability  problem  by  de  Souza  e  Silva 
and  Gail  [8]. 

Puri  [22]  derived  a  linear  system  in  the  double  Laplace  transform  of  the 
distribution  of  accumulated  reward  for  a  general  CTMC  and  arbitrary  re¬ 
ward  structure.  The  numerical  solution  of  the  double  transform  system  was 
proposed  in  [19].  In  Appendix  A  we  present  an  improved  0(ns)  algorithm  to 
evaluate  the  distribution  of  accumulated  reward  for  cyclic  and  acyclic  MRMs 
with  n  states.  Note  that  the  O(n)  speedup  over  our  previous  algorithm  [19] 
makes  considerably  larger  MRMs  solvable  in  practice. 

In  Table  1  we  present  the  measures  that  we  use  to  examine  the  behav¬ 
ior  of  the  example  multi-processor  system.  We  group  the  measures  by  the 
random  variables  used  in  their  definition.  Each  measure’s  properties  are 
then  indicated.  The  properties  that  we  indicate  are  whether  the  quantity 
measured  is  instantaneous  or  cumulative,  steady  state  or  transient.  We  also 
indicate  in  Table  1  whether  the  measure  is  a  distribution  or  a  central  mo¬ 
ment.  We  use  a  column  in  Table  1  for  each  measure  to  indicate  the  model 
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families  each  measure  is  typically  applied  to.  We  use  rel,  av,  Imp-rep,  and 
compl  as  abbreviations  for  the  reliability,  availability,  imperfect  repair,  and 
completion  families  respectively. 


Measure 

Notation 

Common 

Model 

Family 

Cumulative  or 
Instantaneous 
Measure 

Steady 
State  or 
lYansient 

Distribution 

or 

Moment 

ftW 

IKSSSH 

av 

Z(t)  :  1 

T 

■hi 

v, 

av 

Z(t)  :  I 

S 

.  1 

Mt) 

£,6uPp.(0 

av 

■  OH 

T 

M 

A(oo) 

lim^oo  A(t) 

av 

h2®I 

S 

M 

Reliability 

P(X(r)  >  l,Vr  <  t] 

rel 

MB  KM 

T 

cdf 

E[  JV(t)  ) 

e[  m ! 

all 

ilESiHlli 

T 

M 

£[y(0] 

El  Y(t)  1 

Imp-rep 

Y(t)  :  C 

T 

y(*,o 

3/(*.  0 

Imp-rep 

Y(t)  :  C 

T 

Hm,_oe]/(x,0 

Imp-rep 

Y(t)  :  C 

S 

>[Ai(0  <  x] 

WO  <  *1 

av 

W(t)  :  C 

T 

cdf 

E[  jy(<) ) 

E[IV(01 

compl 

W(t)  :  C 

T 

M 

E\  W  (oo)  ) 

E[  W(oo)  ) 

compl 

W(t)  :  C 

S 

M 

W(x,«) 

»(*.*) 

compl 

W(t)  :  C 

T 

cdf 

limj-.oo  It'lx.O 

compl 

W(t)  :  C 

S 

cdf 

Table  1.  Measures  and  Their  Characteristics 

Measures  used  to  characterize  the  behavior  of  Markov  reward  models  of 
the  multi-processor  system  with  imperfect  repair  (without  the  completion 
property)  are  the  reliability,  R(t),  the  distribution  of  accumulated  reward 
(performability)over  a  finite  interval,  ^(x,  t),  and  l/(x,oo)  s  limj  — *0O  3/(*»0- 
On  models  with  the  completion  property  we  use  W(z,f),  and  W(x,oo)  = 
limj—oo ‘U’(x.t).  The  effect  of  changes  in  the  structure-state  process,  the 
reward  structure  and  utilization  interval  on  these  measures  of  performabil- 
ity  for  MRMs  of  the  multi-processor  system  are  investigated  in  next  two 
sections. 

2  Multi-processor  Model  Description 


We  begin  with  a  basic  Markov  reward  model  of  the  multi-processor 
system  and  then  indicate  a  set  of  changes  in  the  structure-state  process 
and  reward  structure.  The  measures  obtained  for  the  various  models  of  the 
multi-processor  system  are  listed  in  Table  1.  In  the  following  section,  each 
graph  plots  measures  for  a  sequence  of  illustrative  models. 

Determining  the  way  changes  in  the  reward  structure  and  the  structure-  > 

state  process  affect  measures  of  interest  is  crucial  to  using  MRMs  effectively 
in  the  system  design  process.  Efforts  to  change  system  behavior  in  a  fa¬ 
vorable  way  must  use  the  appropriate  model  and  measure  or  they  will  be 
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ineffective.  For  example,  consider  adding  a  repair  facility  to  a  high  reliability 
non-repairable  system  (failure  rate  of  —  10-s).  The  steady-state  behavior 
will  change  radically.  However,  if  the  utilization  interval  is  short  (~  10 
hours)  then  the  repair  facility  will  not  substantially  change  the  availability 
over  the  10  hour  interval.  We  wish  to  indicate  some  situations  where  the 
distribution  of  accumulated  reward  or  its  time  average  will  indicate  behavior 
not  captured  by  other  measures.  We  briefly  describe  the  types  of  failure  and 
repair  behavior  of  the  multiprocessor  system  modeled  with  structure-state 
processes.  The  system  consists  of  16  processors,  16  memories,  and  an  inter¬ 
connection  network  (i.e.,  crossbar  switch)  that  allows  a  processor  to  access 
any  memory.  Since  the  system  we  analyze  is  similar  to  the  Carnegie-Mellon 
multi-processor  system,  C.mmp,  we  use  the  failure  data  from  that  system. 
Siewiorek  in  [24]  determined  the  failure  rates  per  hour  for  the  components 
to  be: 

Processor  Memory  Switch 

Failure  Rates  :  A  =  0.0000689  -f  =  0.0002241  6  =  0.0002024  . 

Viewing  the  network  as  a  single  switch  and  modeling  the  system  at  the 
processor-memory-switch  (PMS)  level,  we  see  that  the  interconnection  net¬ 
work  is  essential  for  system  operation.  It  is  also  clear  that  a  minimum 
number  of  processors  and  memories  are  necessary  for  the  system  to  be  op¬ 
erational.  We  follow  Siewiorek’s  choice  of  4  processors,  4  memories  and 
1  interconnection  network  (switch)  as  the  minimal  operating  conflguration 
required  for  handling  a  task.  Each  state  is  specified  by  a  triple  ( i,j,k ) 
indicating  the  number  of  operational  processors,  memories,  and  networks, 
respectively.  We  augment  the  states  with  a  non-operational  state  F.  Events 
that  decrease  the  number  of  operational  components  are  associated  with 
failure,  and  events  that  increase  the  number  of  operational  elements  are  as¬ 
sociated  with  repair.  We  assume  that  failures  do  not  occur  when  the  system 
is  not  operational.  When  a  component  of  the  multi-processor  system  fails,  a 
recovery  action  must  be  taken  ( e.  j .,  shutting  down  a  failed  processor,  so  that 
it  does  not  fill  memories  with  spurious  data),  or  the  whole  system  will  fail 
and  enter  state  F.  The  probability  that  the  recovery  action  is  successfully 
completed  is  known  as  the  coverage. 

We  consider  two  kinds  of  repair  actions,  global  repair  which  restores  the 


system  to  state  (16, 16, 1)  with  rate  /i  =  0.2  per  hour  from  state  F  and 
local  repair,  which  can  be  thought  of  as  a  repair  person  beginning  to  fix 
a  component  of  the  system  as  soon  as  a  component  failure  occurs.  Our 
model  of  local  repair  assumes  that  there  is  only  one  repair  person  for  each 
component  type.  We  let  the  local  repair  rates  per  hour  be: 

Processor  Memory  Switch 
Local  Repair  Rates  :  v  =  2.0  rj  =  1.0  c  =  0.5  . 

A  further  refinement  of  the  structure-state  process  can  be  made  with  respect 
to  the  interconnection  network.  Siewiorek  in  |25]  notes  that  the  C.mmp  in¬ 
terconnection  network  is  actually  implemented  as  a  set  of  16  fan-out  switches 
for  each  processor  and  memory  port.  In  this  case  the  failure  rate  of  the  in¬ 
terconnection  system  with  respect  to  some  operational  configuration  (t,y,  1) 


is  simply  $(»,/)  =  (*+/)  *  (fanout  switch  failure  rate  +  line  failure  rate). 
Since  the  cause  of  failure  is  uniformly  distributed  over  the  fanout  switches 
and  their  lines,  we  will  simply  let  the  failure  rate  associated  with  each  fanout 
switch  and  line  pair  be  1/32  of  the  lumped  failure  rate  of  the  switch.  Thus 
$(10, 10)  =  20  *  0.000006325  =  0.0001265.  We  are  pessimistic  in  that  we 
assume  that  the  failure  of  one  fan-out  switch  and  line  brings  the  system  to 
a  non-operational  state  (i.e.,  (t ,  jf, 0)).  The  single  or  “lumped”  network  with 
failure  rate  6  is  more  pessimistic  than  the  “distributed”  network  with  failure 
rate  A  Markov  model  of  the  structure-state  process  for  the  C.mmp 

system  with  a  “lumped”  network  and  global  repair  has  170  states. 

The  two  variations  of  the  structure-state  process  we  consider  for  the 
failure  transitions  are  imperfect  coverage  (i.e.,  leakage  to  state  F),  and  the 
network  failure  rate  (“lumped”  or  “distributed”).  Local  or  global  repair 
actions  are  the  two  kinds  of  repair  strategies  investigated.  The  substantial 
increase  in  model  complexity  that  results  from  adding  a  local  repair  capa¬ 
bility  is  evident  in  Figure  2,  which  depicts  the  structure-state  process  of  a 
model  with  a  “lumped”  interconnection  network,  local  and  global  repair, 
and  imperfect  coverage  (365  states).  The  lower  plane  in  Figure  2  contains 
the  set  of  states  where  component  exhaustion  has  occurred.  Most  of  the 
states  (169)  in  the  lower  level  are  the  result  of  the  interconnection  network 
failing.  Thirteen  states  represent  system  failure  due  to  the  exhaustion  of 
operational  memories  and  thirteen  more  states  represent  system  failure  due 
to  the  exhaustion  of  operational  processors.  The  local  repair  models  will 
include  both  local  and  global  repair.  When  we  Bpeak  of  a  model  with  only 
global  repair,  we  set  all  local  repair  rates  ( v ,  r\,  e)  in  Figure  2  to  zero 
and  merge  all  non-operational  states  with  state  F.  The  structure-state  pro¬ 
cesses  of  the  MRMs  of  the  multi-processor  system  thus  can  be  characterized 
by  their  failure  type  (coverage),  interconnection  network  type  (“lumped”  or 
“distributed"),  and  repair  type  (global  or  local)  . 

It  remains  to  present  the  reward  structures  we  use  to  characterize  the 
performance  behavior  of  the  multi-processor  system  when  it  is  in  a  given 
structure  state.  The  simplest  reward  structure  is  obtained  by  dividing  the 
structure  states  into  two  classes,  operational  and  non-operational,  and  as¬ 
signing  the  reward  rate  1.0  to  the  operational  states  and  0  to  the  rest.  A 
more  accurate  measure  of  system  performance  is  more  closely  related  to  the 
system’s  ability  to  do  useful  work.  Because  memory  is  the  slowest  resource 
in  the  C.mmp  system,  the  effectiveness  of  the  system  is  limited  by  the  num¬ 
ber  of  available  memories.  Thus  if  there  are  more  memories  than  processors, 
performance  will  still  be  limited  by  the  memory  bandwidth  needed  by  the 
processors,  while  if  there  are  more  processors  than  memories  the  perfor¬ 
mance  will  be  limited  by  the  number  of  memories.  A  simple  capacity-based 
performance  model  of  an  operational  structure-state  (i,  j,  1)  is  to  let  the  as¬ 
sociated  reward  rate  be  min  This  performance  model  is  optimistic 

because  it  does  not  consider  processors  contending  for  the  memories. 

When  we  consider  contention  for  the  memories,  we  use  a  model  devel¬ 
oped  by  Bhandarkar  [2]  to  obtain  the  average  number  of  busy  memories  or 
memory  bandwidth.  Bhandarkar  found  the  average  number  of  busy  memo- 


ties,  and  hence  the  reward  rate  in  an  operational  state  (i,  j,  1)  to  be: 

r*'j.i  =  m(l  ~  (1  “  Vm)*)»  (2) 

where  l  =  min{i,j }  and  m  =  We  assign  a  zero  reward  rate  to 

each  non-operational  state.  Hence,  in  addition  to  a  variety  of  structure-state 
processes  we  also  have  three  reward  structures  of  interest,  the  availability- 
based  reward  structure  (0,  1),  the  capacity-based  reward  structure  j)  ), 

and  the  contention-based  reward  structure  (equation  (2)). 

The  initial  state  of  the  system  in  all  our  models  will  be  (16,16,1)  except 
in  section  3.3  where  £(0),  the  initial  state  probability  vector,  is  equal  to 
the  steady-state  probability  ve  'r,  x.  The  effect  of  changes  in  utilization- 
interval  length,  structure-state  process,  and  reward  structure  for  the  multi¬ 
processor  MRMs  are  examined  in  the  next  section. 

3  Multi-processor  Performability  Results 

3.1  The  effects  of  coverage  and  utilization  interval  on  E[  X(t)  j 
and  £(W(t)|,  functions  of  p(f). 

First,  we  use  a  sequence  of  models  that  illustrate  the  way  the  completion 
property  affects  £[X(t)|  and  £(lV(t) ]  as  a  function  of  time  in  Figures  3 
and  4,  respectively.  In  both  Figure  3  and  Figure  4,  we  use  our  contention- 
based  performance  mode]  to  obtain  the  reward  structure.  E|X(t))  is  the 
expected  instantaneous  reward  at  time  t  and  has  been  called  the  computation 
availability  in  |l).  This  measure  answers  the  question,  “What  is  the  expected 
performance  of  the  system  at  time  tV .  E[W(t)]  is  the  expected  time- 

averaged  accumulated  reward  over  the  interval  (0 ,t).  E{  W(t)  )  answers  the 
question,  “What  is  the  time-averaged  performance  of  the  system  over  the 
interval  (0,  t)?”.  In  Figures  3  and  4  we  let  curve  1  be  a  ‘pure’  performance 
model  of  the  state  (16, 16, 1).  The  'pure’  performance  model  does  not  have 
any  failures  so  the  system  performance  is  independent  of  time.  With  memory 
contention  but  no  failure,  the  reward  rate  is  10.303  and  both  £[A~(t)]  and 
E[W(t)  j  are  10.303  for  all  time  t.  In  curve  II,  only  component  failures  occur 
(c  =  1  for  coverage),  and  we  see  that  the  expected  performance  level  has  been 
halved  at  time  t  =  2000.  At  time  t  =  2000  in  Figure  4,  the  expected  time- 
averaged  accumulated  reward  has  decreased  by  only  one  quarter  because 
E[  IV (<)  )  is  the  time  average  of  £(A"(<)j  over  (0,/).  Thus  E\  W(t)  )  is 
relatively  insensitive,  for  large  t,  to  the  state  of  the  system  at  a  particular 
instant,  r  <  t.  Both  Figure  3  and  Figure  4  show  the  importance  of  the 
completion  property.  Models  with  the  completion  property  (curves  I,  IV 
and  V)  strongly  dominate  those  without  it  (curves  II  and  III)  indicating  the 
value  of  global  repair  for  long  utilization  intervals. 

In  curves  III  and  V  the  coverage  is  reduced  to  0.9.  Curve  III  like  curve  II 
has  no  repair,  and  the  expected  performance  level  of  curve  III  deteriorates 
more  rapidly  than  that  of  curve  11.  Curve  IV  has  only  component  failures 
(e  =  1),  and  global  repair  as  well.  Consequently,  the  expected  performance 
level  of  curve  IV  is  much  improved  over  that  of  curve  II,  especially  for  large 
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t.  One  might  expect  curve  IV  to  dominate  curve  V,  which  uses  the  same 
mode]  aa  curve  IV  with  c  =  0.9,  just  as  curve  II  dominates  curve  111. 
However,  for  large  t  it  is  better  on  the  average  to  experience  a  coverage 
failure  and  rapidly  return  to  the  highest  reward  state  (16, 16, 1)  rather  than 
spend  a  long  interval  in  the  relatively  low  reward  states  before  returning  to 
structure  state  (16, 16, 1).  Both  of  these  measures  indicate  the  importance 
of  global  repair  for  longer  time  intervals. 

Unfortunately  £(.X(t)]  and  E\  W (t)  ]  do  not  address  the  likelihood  of 
completing  a  given  amount  of  work  in  a  specified  interval.  E(tV(t)]  merely 
gives  an  indication  of  the  average  behavior  over  a  utilization  interval.  We  use 
yc(x,t)  to  examine  the  behavior  of  a  non-repairable  system  over  different 
length  utilization  intervals  in  the  next  section. 

3.2  The  effect  of  utilization  interval  on  yc(z,t)  for  non- 
repairable  models. 

We  consider  a  model  of  the  C.mmp  system  with  a  “lumped”  intercon¬ 
nection  network,  c  =  0.90  for  the  coverage,  and  no  repair  in  Figure  5.  The 
CTMC  of  the  structure-state  process  is  depicted  in  Figure  2  with  all  re¬ 
pair  rates  p,i/, ij,c  set  equal  to  zero.  The  reward  structure  is  based  on  the 
contention-based  performance  model.  Curves  1,  II,  III,  IV  plot  the  value  of 
yc(x,t)  for  t  =  100,  1000,  10000,  and  oo  respectively. 

Loosely  speaking,  l/c(z,i)  answers  the  question,  “  What  is  the  proba¬ 
bility  that  x  units  of  work  is  completed  by  time  t?”  Because  the  model  does 
not  have  the  completion  property,  l/c(x,t)  is  substantially  less  than  1.0  for 
moderate  amounts  of  accumulated  reward  even  if  t  — ►  oo.  It  is  interest¬ 
ing  to  note  that  l/c(x,t)  for  moderate  t  only  falls  below  limt_c>o  J/(x,t)  as 
x  — *  ~  9 1. 

The  non-repairable  system  performs  near  its  asymptotic  limit,  l/c(x,oo) 
for  moderate  t.  However,  systems  that  satisfy  the  completion  property  will 
complete  any  finite  amount  of  work  in  an  arbitrarily  long  utilization  inter¬ 
val.  When  comparing  different  systems  for  the  same  utilization  interval, 
yc(x,t)  is  quite  satisfactory,  whether  the  system  satisfies  the  completion 
property  or  not.  If  we  wish  to  compare  the  behavior  of  systems  that  satisfy 
the  completion  property  over  different  utilization  intervals,  then  we  need  to 
normalize  the  curves  of  the  different  complementary  distributions  of  accu¬ 
mulated  reward  so  that  they  can  be  compared  over  the  same  interval.  The 
natural  approach  is  to  time  average  the  accumulated  reward  and  use  W[t) 
as  the  random  variable  rather  than  V (<) .  In  the  next  section  we  examine 
the  behavior  of  a  system  that  satisfies  the  completion  property  over  different 
utilization  intervals.  The  results  are  rather  surprising. 

3.3  The  effect  of  utilization  interval  on  1Vc(x>t)  for  models 
with  the  completion  property. 

As  noted  in  section  3.1,  both  £[X(()]  and  £[W(t)]  are  functions  of 
the  instantaneous  probability  vector,  £(t).  If  we  let  the  initial  probability 
vector,  £(0),  of  the  system  equal  the  steady-state  probability  vector,  x,  then 
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neither  £[.X(<)  ]  Dor  i£[W(t)  ]  will  change  since  then  g(t)  =  £  for  all  t.  We 
show  the  presence  of  behavior  not  detected  by  these  measures  in  Figure  6. 
In  Figure  6  we  use  a  structure-state  process  with  global  repair  and  coverage 
=  0.95  to  model  the  failure  and  repair  activity  of  the  C.mmp  system  and 
the  contention-based  performance  model  to  obtain  the  reward  structure. 
The  measure  Wc(z,  t)  can  be  used  to  answer  the  question,  "What  is  the 
probability  that  the  reward  accumulated  in  the  interval  (0,  t)  is  at  least  zt?" 

We  examine  the  distribution  of  time-averaged  memory  bandwidth  (per¬ 
formance)  for  utilization  intervals  of  length  10,  100,  1000,  and  10000  in 
curves  I,  11,  III  and  IV  of  Figure  6.  We  indicate  the  steady-state  expected 
reward  rate,  *«ri»  with  »  vertical  line  labeled  V  (  +  +  -f  ).  We  can  see 
the  way  the  curve  smooths  out  and  approaches  a  jump  at  the  steady-state, 
time-averaged  reward  rate  as  t  increases.  The  dynamic  behavior  of  the  sys¬ 
tem  in  steady  state  is  indicated  in  Figure  6.  Measures  such  as  £[.X(f)] 
and  £[lV(t)]  are  unable  to  capture  the  steady-state  system  dynamics  since 
both  these  measures  are  invariant  with  respect  to  time  for  the  Markov  re¬ 
ward  model  with  g(0)  =  £. 


The  effect  of  reward  structure,  and  model  “family”  on 
Tt>c(z,t)  for  models  with  the  completion  property. 


Insight  into  the  way  the  structure-state  process  and  the  reward  struc¬ 
ture  affect  the  ability  of  the  multi-processor  system  to  complete  a  fixed 
amount  of  work  in  a  given  time  interval  ( 0,t )  is  obtained  from  the  com¬ 
plementary  distribution  of  time-averaged  accumulated  reward.  We  plot 
the  complementary  distribution  of  time-averaged  accumulated  reward  (in 
this  case  the  time-averaged  memory  bandwidth)  for  a  basic  Markov  re¬ 
ward  model  with  a  “lumped”  interconnection  network,  perfect  coverage 
(c  =  1),  and  global  repair.  We  use  “pure"  performance  models  to  pro¬ 
vide  an  optimistic  upper  bound  for  MRMs  comparing  tbe  capacity-based 
and  contention-based  reward  structures  resulting  from  tbe  different  perfor¬ 
mance  assumptions  about  the  way  memory  is  accessed.  We  examine  tbe 
distribution  of  W(t)  and  the  distribution  of  the  interval  availability,  Aj(t), 
in  Figure  7.  First  we  consider  the  system  without  failure  and  repair  in 
curves  I  and  II.  The  result  of  this  modeling  assumption  is  that  no  degra¬ 
dation  of  performance  takes  place  and  the  state  of  the  system  is  always 
(16, 16, 1).  Consequently,  curves  I  and  II  of  the  complementary  distribu¬ 
tion  of  time-averaged  accumulated  reward  are  step  functions.  If  we  ignore 
memory  contention,  then  there  are  16  processors  and  16  memories  and  tbe 
memory  bandwidth  is  16.  It  follows  that  the  system  performance  level  (re¬ 
ward  rate)  is  constant  and  W(t)  =  16.  Curve  I  in  Figure  7  depicts  this  unit 
step  form  of  the  complementary  distribution  of  time-averaged  accumulated 
reward.  For  curve  II,  we  assume  that  there  is  contention  at  tbe  memories. 
The  result  of  modeling  the  contention  is  to  lower  tbe  ability  of  tbe  system 
to  deliver  useful  work.  Therefore,  the  step  for  curve  II  occurs  at  a  smaller 
value  of  accumulated  reward  per  unit  time  than  the  step  for  curve  I.  We  use 
the  work  of  Bhandarkar  (2]  to  estimate  the  effect  of  contention  on  the  per- 
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formance  of  state  (16, 16, 1).  Hence,  with  memory  contention  but  no  failure, 
the  reward  rate  is  10.303  for  all  time  t.  Thus,  in  curve  11  the  complementary 
distribution  of  W  (t)  is  a  unit  step,  though  at  10.303  instead  of  16  as  in  curve 

I. 

In  curves  III  and  IV,  we  examine  the  effect  of  modeling  failure  and  repair 
on  the  complementary  distribution  of  time-averaged  accumulated  memory 
bandwidth.  For  curve  III,  we  assume  there  is  no  memory  contention  and 
use  the  capacity-based  performance  model  for  each  structure  state  in  Fig¬ 
ure  2  (assuming  no  local  repair  and  merging  all  non-operational  states  in 
F).  The  performance  (reward  rate  or  memory  bandwidth)  of  operational 
state  (»,/,  1)  is  set  to  min  The  gap  between  curve  I  and  curve  III  re¬ 

flects  the  fact  that  when  failure  and  repair  are  taken  into  account,  memory 
bandwidth  varies  with  time  thus  lowering  the  time-averaged  accumulated 
memory  bandwidth.  Without  the  occurrence  of  failures,  the  memory  band¬ 
width  stays  constant  at  16.  Another  way  of  stating  the  situation  is  to  say 
that  curve  III  will  asymptotically  approach  curve  I  as  the  maximum  of  all 
the  failure  rates  tends  to  zero.  Curve  IV  has  a  similar  relationship  to  curve 

II.  In  curve  IV,  we  use  our  most  detailed  performance  model,  and  take  into 
account  failure  and  repair.  Thus  the  performance  level  (reward  rate)  of  each 
operational  state  (t,j,  1)  is  determined  by  equation  (2).  Because  the  perfor¬ 
mance  degradation  due  to  component  failure  is  smaller  with  Bhandarkar’s 
performance  estimates  than  with  the  capacity-based  performance  estimates, 
curve  IV  more  closely  approaches  curve  II  than  curve  HI  approaches  curve 
I.  The  relationship  of  the  4  curves  discussed  indicates  that  the  performance 
model  assumptions  show  an  upper  limit  of  the  system’s  ability  to  complete 
work.  The  magnitude  of  the  failure  and  repair  rates  effect  the  rate  at  which 
the  complementary  distribution  of  time-averaged  memory  bandwidth  de¬ 
clines  below  the  step  function  deffned  by  the  performance  model. 

We  see  that  ‘pure’  performance  models  overestimate  the  ability  of  the 
system  to  complete  useful  work.  For  example,  using  curve  IV,  we  see  that 
the  probability  that  the  time-average  memory  bandwidth  is  greater  than 
or  equal  to  9.5  is  0.989,  whereas  using  the  ’pure’  performance-based  model 
of  curve  II,  this  probability  is  1.  It  is  also  true  that  ‘pure’  failure/repair 
(availability)  models  in  which  the  reward  rates  for  operational  states  are  set 
to  1.0  and  non-operational  states  are  assigned  reward  0.0  underestimate  the 
ability  of  a  system  to  complete  useful  work  when  the  performance  levels  are 
scaled  in  such  way  that  the  minimum  reward  operation  state  has  a  reward 
rate  >  1.0.  Using  this  reward  structure,  W(t)  is  the  interval  availability 
A/(t).  To  complete  the  set  of  reward  structures  considered  for  performa- 
bility  models  of  the  multi-processor  system,  with  curve  V  we  display  the 
complementary  distribution  of  interval  availability.  We  see  that  curve  V  is 
nearly  a  step  function  at  1.0  because  only  a  network  failure  will  cause  the 
system  to  immediately  enter  state  F  (13  processor  or  13  memory  failures 
must  occur  before  the  system  will  enter  state  F). 
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3.5  Tbe  effect  of  coverage  and  utilization  interval  on  Wc(x,t) 

for  models  with  the  completion  property. 

In  this  section  we  continue  examining  a  model  of  the  multi-processor 
system  with  a  “lumped”  interconnection  network,  global  repair,  and  different 
coverage  values.  We  will  use  the  most  accurate  performance  model,  namely 
Bhandarkar’s,  to  obtain  the  reward  structure  for  the  operational  states.  In 
Figure  8  we  show  the  effect  of  coverage  and  of  the  observation  period  on 
the  chosen  measure  of  effectiveness.  As  t  — ►  0,  independent  of  c,  W(t) 
approaches  the  'pure*  performance  behavior  shown  in  curve  I,  a  step  at 
10.303.  Curves  II,  111,  and  IV  (c  =  1.0,  0.95,  and  0.90,  respectively)  show 
that  for  larger  observation  intervals  (t  =  100),  the  higher  coverage  curves 
dominate  the  lower  coverage  curves  illustrating  tbe  effect  of  coverage  on 
the  complementary  distribution  of  time-averaged  accumulated  reward.  In 
curves  V-VII  of  Figure  8,  we  plot  the  steady-state  computation  availability, 
limi_oo  £[W(t))  =  »•**,,  for  the  different  coverage  values  where  is  the 

steady  state  probability  of  being  in  state  t.  We  can  see  that  as  tbe  length 
of  the  utilization  interval  increases  the  probability  of  accumulating  a  given 
amount  of  reward  becomes  more  pessimistic.  One  cause  of  this  effect  is  that 
repair  takes  place  only  when  the  whole  system  has  become  inoperable  and 
the  failure  rates  are  ant  all  enough  to  make  the  occurrence  of  more  than  one 
failure  in  a  relatively  small  interval  (100  hours)  extremely  unlikely.  States 
in  which  a  significant  number  of  failures  have  occurred  become  more  likely 
as  time  passes.  Allowing  repair  only  when  the  system  has  failed  yields  the 
relative  position  of  curves  V-VII  (c  =  1.0,  0.95,  and  0.90,  respectively).  As 
the  coverage  probability  decreases,  the  steady-state  computation  availability 
actually  increases! 

The  anomalous  behavior  of  the  steady-state  computation  availability  is 
caused  by  several  factors:  the  disparity  in  the  reward  rates  of  tbe  operational 
states',  the  relatively  large  global  repair  rate  in  relation  to  failure  rates;  and 
the  assumption  that  the  global  repair  rate  is  independent  of  tbe  number  of 
failed  components.  If  we  set  the  reward  rates  for  all  the  operational  states  to 
be  1.0  and  0  otherwise,  then  the  availability  (£4  r^Xj)  decreases  as  the  cover¬ 
age  decreases  (the  anomaly  disappears).  Similarly,  the  anomaly  disappears 
if  we  make  the  global  repair  rate  comparable  to  tbe  failure  rates  or  make  it 
dependent  on  the  actual  number  of  components  that  have  failed.  Also,  local 
repair  causes  the  anomaly  to  disappear.  Tbe  point  is  that  extrapolating 
from  steady-state  values  and  expected  values  can  be  misleading. 

3.6  The  effect  of  interconnection  network  type  and  repair 
capabilities  on  M>c(z,t)  for  models  with  the  completion 
property. 

We  examine  the  effect  of  adding  a  local  repair  facility  for  each  com¬ 
ponent  type  to  tbe  multi-processor  system  in  this  section.  In  Figure  9  we 
obtain  Wc(x,t)  for  two  pairs  of  models  with  a  utilization  interval  of  100 
hours.  In  curve  I  we  plot  Wc(x,t)  for  a  model  of  tbe  multi-processor  system 
with  both  global  and  local  repair,  e  =  0.90  for  coverage  and  a  distributed 
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interconnection  network  (  +  +  +  ).  The  model  used  to  obtain  curve  II 
(solid)  is  the  same  as  that  used  for  curve  I  except  for  a  “lumped’’  instead  of 
distributed  interconnection  network.  The  effect  of  the  slightly  lower  failure 
rate  of  the  distributed  interconnection  network  is  small  but  discernible. 

Curve  III  is  the  plot  of  "Wc(z,t)  for  the  same  model  as  curve  I  without 
local  repair.  Replacing  the  distributed  interconnection  network  in  curve 
III  with  the  slightly  more  failure  prone  “lumped  ”  interconnection  network 
model  produces  Curve  IV.  The  difference  between  curves  III  and  IV  is  also 
quite  small.  The  effect  of  the  local  repair  facility  is  sizable  for  time-averaged 
memory  bandwidth  requirements  greater  than  9.2.  This  result  indicates  the 
value  of  local  repair  for  the  multi-processor  system  over  even  moderately 
sized  utilization  intervals.  Another  way  of  expressing  the  situation  is  to 
observe  that  as  the  time-averaged  workload  requirement  increases,  the  size 
of  the  utilization  interval  where  the  local  repair  facility  will  substantially 
increase  1Vc(x,t)  becomes  smaller.  Roughly  speaking,  we  can  conclude  that 
local  repair  is  worthwhile  for  systems  expected  to  operate  at  nearly  full 
capacity  (maximum  reward  rate),  even  if  the  utilization  interval  is  of  only 
moderate  size. 

4  Conclusion 

The  ability  to  determine  the  distribution  of  accumulated  reward  and 
its  time  average  for  moderate  size  problems  is  a  recent  development.  We 
presented  a  systematic  study  of  a  complex  multi-processor  system  and  an 
0(ns)  algorithm  for  the  computation  of  the  distribution  of  accumulated 
reward  of  general  Markov  reward  models. 

The  study  of  Markov  reward  models  of  the  multi-processor  system  points 
to  a  number  of  interesting  facts  about  different  performability  measures. 
The  first  three  examples  indicate  that  instantaneous  measures  do  not  show 
the  dynamic  behavior  of  the  system  while  Wc(i,t)  does.  The  next  two  ex¬ 
amples  show  that  steady-state  values  are  deceptive  in  some  circumstances, 
and  the  final  example  indicates  the  importance  a  local  repair  facility  may 
have  on  the  distribution  of  performability  for  moderate-size  utilization  in¬ 
tervals.  Furthermore  the  study  indicates  how  changes  in  the  failure/repair 
behavior  of  the  system  such  as  the  interconnection  network  failure  rate, 
repair  strategy,  and  coverage  probability  affect  the  complementary  distribu¬ 
tion  of  accumulated  reward.  We  also  examine  the  way  changing  the  reward 
structure  and  utilization  interval  affects  the  distribution  of  time-averaged 
accumulated  reward.  Thus  some  inadequacies  of  steady-state  values  and  ex¬ 
pected  values  are  illustrated  and  an  examination  of  how  changes  in  Markov 
reward  models  effect  the  performability  distribution  is  made.  The  new  algo¬ 
rithm  presented  in  the  paper  can  thus  aid  the  system  designer  in  exploring 
detailed  dynamic  behavior  of  multi-processor  systems. 
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Figure  3:  Expected  Instantaneous  Memory  Bandwidth  E|A(f)]  Vs.  time  t. 
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Figure  4:  Time- Averaged  Expected  Accumulated  Bandwidth  E[lV(t)]  Vs. 
time  t. 
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P[  Y(t)  >  x  ] 


Figure  5:  Complementary  Distribution  of  Accumulated  Bandwidth  yc(x,t) 
for  Different  t  Values. 
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Figure  6:  Complementary  Distribution  of  Time-Averaged  Accumulated 
Bandwidth  )Vc[x,t)  for  Different  t  Values. 
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P[  W(t)  >  x  ]  (  t  =  100  ) 


Figure  7:  Complementary  Distribution  of  Time-Averaged  Accumulated 
Bandwidth  for  Different  Reward  Structures. 
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Figure  8:  Complementary  Distribution  of  Time- Averaged  Accumulated 
Bandwidth  for  Different  Coverage  Values. 
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Figure  9:  Complementary  Distribution  of  Time- Averaged  Accumulated 
Bandwidth  Wc(x,t)  for  Different  Networks  and  Different  Repair  Policies. 

A  An  Algorithm  and  Its  Analysis 

In  this  section,  we  detail  the  double-transform  inversion  method  for  the 
distribution  of  accumulated  reward.  We  begin  with: 

JA(x,t)  =  P[  y(t)  <  *  |  2(0)  =  t  ). 

First,  we  apply  the  LST,  (i.e.,  J”  e~ux dy *(x ,t)  )  ,  signified  by  ~,  to  ]/i(x,t) 
with  respect  to  the  work  requirement  x  (transform  variable  u),  and  then 
apply  the  Laplace  Transform  signified  by  *  with  respect  to  time  t  (transform 
variable  •).  The  following  linear  system  has  been  derived  for  ^~*(u,«)  in 
{22]  and  (1 8]: 

(«/  +  uR  -  Q)  l/~’(u,a)  =  e.  (3) 

The  matrix  of  reward  rates  is  R  =  diag  jrj,  rj,  . . . ,  r<,  . . . ,  rn),  Q  is  the 
generator  matrix  of  the  CTMC,  and  e  is  a  column  vector  of  size  n  with  all 
elements  equal  to  1. 

Using  Cramer’s  rule,  we  can  see  that  is  a  rational  function  in 

•  .  Hence,  it  has  a  partial  fraction  expansion: 

*"(».•)  =  EE‘i>w  (•-»»)■*  w 

,=!*=! 

where  the  Ay( u),  1,  2,  ...,  j,  ...,  d  are  the  d  distinct  eigenvalues  of 
(  Q  -  uR  ],  each  with  algebraic  multiplicity  m,-.  The  QR  algorithm  [27]  is 
used  to  numerically  determine  eigenvalues  of  (Q  -  uR]  in  0(ns)  time.  Using 
(4)  we  can  invert  analytically  with  respect  to  «  and  obtain: 

]/<->.<)  = 

;  =  1  k=l  ' 


I 
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We  deftae  the  column  vector  A ,•  by: 


Olll 

<hij 


A  = 


®ilmi 


and  the  row  vector  £r(s)  by: 

£r(e)  =  |(«- A,(u))-‘  •••  (*  -  Ai(u))-mi 

Now  equation  (4)  can  be  written  in  vector  notation  as: 

yr>.*)  =  et(*)a,  . 


(e  -  A,(u))—-  |  . 


In  order  to  determine  A  we  need  n  linearly  independent  equations.  For 
this  purpose  we  choose  n  distinct  values  of  #,  denoted  by  si,  *},  . s„ 
sufficiently  separated  from  the  eigenvalues  of  [Q  -  ui?].  The  matrix  E  is 
constructed  from  the  eigenvalues  of  {<3  —  ufl]  and  the  n  values  of  a: 


ET{»i) 


A  way  of  choosing  the  n  values  of  a  so  that  E  has  a  reasonably  Bmall  con¬ 
dition  number  is  to  choose  ay  so  that  the  jlh  element  of (ay)  ~  1.0.  This 
causes  the  diagonal  elements  of  E  to  be  reasonably  large,  although  it  does 
not  guarantee  E  is  non-singular.  If  E  is  found  to  be  singular,  then  a  new  a 
value  can  easily  be  chosen  at  the  time.  We  then  solve  the  linear  system 


E& i  =  = 


yr-(u,a,) 


(5) 


for  the  unknown  vector  A  once  the  right-hand  side  has  been  determined. 
Since  the  problems  we  consider  are  of  small  size  (for  the  solution  of  a  lin¬ 
ear  system)  we  use  a  direct  method  (LU  factorization)  on  E.  The  0(ns) 
LU  factorization  must  be  done  only  once  and  an  0(n*)  backsolve  must 
be  done  for  each  of  the  n  possible  right-hand  sides.  Thus  we  may  solve 
E  A  =  ^j~*(u,s)  for  all  n  values  of »  for  a  cost  that  is  0(ns).  The  prac¬ 
tical  implication  of  this  fact  is  that  the  ^~(u,t)  may  be  obtained  at  0(ns) 
cost  if  the  n  different  right-hand  sides  can  be  obtained  at  an  O(n’)  cost  as 
well. 

Since  we  are  interested  in  the  n  vectors  of  partial  fraction  coefficients 
A,  1  <  <  <  n,  let  us  define 


A  =  (4i  A  •••A,)  y  =  l&~>,«)lk~,(u,fi)  •••&-•(«.*)). 
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Hence  the  problem  of  determining  the  n*  partial  fraction  coefficients  can  be 
written  in  matrix  form  as: 

EA  =  Y  .  (6) 

Because  of  the  economical  way  direct  methods  handle  multiple  right-hand 
sides  we  need  only  address  the  problem  of  determining  all  the  1 
that  make  up  Y  in  0(ns)  time. 

We  first  transform  the  linear  system  (3)  to  a  simpler  one.  Any  matrix 
can  be  put  into  upper  Hessenberg  form  using  a  sequence  of  Householder 
unitary  similarity  transformations  |27|.  Therefore  we  can  write 

U t  =  IT1  and  U^{Q  -  uR)U  =  H  , 

where  t  denotes  conjugate  transpose  and  H  is  an  upper  Hessenberg  matrix. 
Bj  making  the  transformation  f/^j/~*(u,s)  =  ,M~*(ti.  s)  the  linear  system 
(3)  can  be  rewritten  as 

(si-  H)M~'(u.s)  =  uU 

This  upper  Hessenberg  linear  system  requires  only  0(nl)  time  to  determine 
■M~*(u. s).  j/~*(u,s)  can  be  regained  from  Ai~*(u. s)  by  a  matrix  vector 
product: 

ld('*>')  =  UKl{u,*)  , 

which  also  costs  C^n1).  An  important  observation  here  is  that  the  required 
sequence  of  unitary  transformations  ((/)  and  the  matrix  H  are  already  avail¬ 
able  from  the  QR  algorithm  that  solves  the  eigenvalue  problem  for  (Q  -  uR) 
and  hence  does  not  add  any  cost.  Thus  the  complexity  to  obtain  j/~*(u,«) 
is  now  only  0(n*)  for  every  value  of  s  for  each  u. 

It  remains  to  invert  ^,~(u,  t)  with  respect  to  u.  A  number  of  methods  to 
numerically  invert  the  Laplace  transform  have  been  developed.  Orthogonal 
polynomials  (26]  and  Fourier  series  [4]  (5]  have  been  tbe  most  commonly  used 
tools  for  inverting  the  Laplace  TYansform.  To  avoid  unnecessary  notational 
complexity  we  define  V(u)  s  l/i~(v,t)/u  =  yi*{u,t)  and  to  follow  standard 
notation  let  »  =  s/-l  in  the  next  two  equations.  We  employ  the  following 
method  to  numerically  obtain  v(z),  the  inverse  Laplace  Transform  of  V (u) 
using  the  well  known  complex  inversion  formula 

f»+ioo  ,•>  roo 

v(z)  =  /  e^V^uJdu  =  —  /  5?{V(u)e,h,‘}du> 

J  t  —too  x  Jo 

where  «  =  «  +  iu.  If  the  above  integral  is  now  discretized  using  the  trape¬ 
zoidal  rule  with  step  size  x/T,  the  following  Fourier  series  approximation 
i>(z),  of  period  2T,  is  obtained: 


«W  -  ^l^  +  f;(*(V(o+^))co.(^)-a(V(a+^i))5,„(^)}) 


The  discretization  error  declines  exponentially  as  aT  increases  (7): 


(7) 
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C(i)  -  v(x)  =  £  e-n°rv(2Jtr  +  x);  0  <  x  <  2T  . 

k=\ 

Since  v(x)  <  1.0  Vx,  the  discretization  error  is  easily  made  very  small. 
Therefore,  the  bulk  of  the  error  in  the  numerical  inversion  procedure  ac¬ 
crues  from  truncating  the  Fourier  series.  The  Fourier  series  exhibits  char¬ 
acteristically  slow  convergence.  However,  acceleration  methods  that  allow 
accurate  estimates  of  the  series  from  the  first  m  terms  are  known.  We  use 
the  quotient-difference  algorithm  of  Rutishauser  with  a  remainder  estimate 
suggested  by  DeHoog  et  al.  in  [5]  to  accelerate  the  convergence  of  the  Fourier 
series.  Cooley  et  al.  in  (4]  use  the  cosine  transform  to  approximate  a  series 
very  similar  to  (7),  and  Jagerman  [17]  obtains  an  expression  similar  in  form 
to  (7)  by  considering  the  generating  function  of  a  sequence  of  functionals 
that  converge  in  the  limit  to  v(x).  Because  of  the  0(ns)  cost  of  computing 
each  function  value,  the  method  that  reliably  yields  accurate  results  with  the 
fewest  evaluations  is  best.  We  have  been  pleased  with  the  results  obtained 
when  the  Fourier  series  is  evaluated  to  the  first  m  —  80  terms  with  the  De¬ 
Hoog  remainder  estimate  (even  when  the  desired  distribution  has  jumps  at 
various  values  of  x).  The  structure  of  the  overall  algorithm  is  as  follows: 

A:  Determine  ^(u,t) 

for(  m  values  of  u)  { 

determine  the  eigenvalues  of  (uiZ  -  Q) 
for(  d  unique  eigenvalues  of  (u R  -  Q)  )  { 
solve  transformed  Hessenberg  system 

} 

evaluate  partial  fraction  coefficients 

} 

B  :  Numerical  Laplace  Transform  Inversion 
for(  n  states  )  { 

for(  p  desired  values  of  t  )  { 
for(  m  values  of  u  )  { 

■um  partial  fraction  coefficients  to  evaluate  V(u) 

> 

for(f  values  of  x)  { 

sum  Fourier  series  approximation  to  evaluate  v(x) 

} 

} 

} 

In  the  worst  case,  the  inner  loop  of  phase  A  of  the  computation  is  executed 
0(n)  times.  Since  each  iteration  of  the  inner  loop  has  a  computational 
cost  of  0(n*),  phase  A  has  a  computational  complexity  of  0(mns).  The 
computational  cost  of  phase  B  is  primarily  a  function  of  the  p  different 
values  of  time  t  at  which  l/(x,t)  is  to  be  evaluated  and  the  m  terms  in  the 
Fourier  series  approximation.  Phase  B  has  a  computational  complexity  of 


0(ns) 

O(n’) 

O(n’) 


O(n) 

O(m) 
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0(  pmn  (n+g)  ).  Therefore,  phase  A  comprises  the  principal  computational 
«  burden  of  the  algorithm.  The  total  computational  effort  to  obtain  y{x,t)  for 

q  values  of  x  at  each  of  p  values  of  t  is  Otpmnfn  +  q)  +  mns).  The  practical 
implication  is  that  once  the  computationally  expensive  phase  A  has  been 
'  used  to  determine  y~(u,t)t  evaluating  1/ (x,  t)  at  other  (x,t)  points  can  be 

done  very  cheaply. 

Often  the  constants  that  are  brushed  under  the  rug  by  the  0()  notation 
are  important.  The  computational  cost  of  the  algorithm  is  approximately 
16m(l+a)n9  where  a  is  a  difficulty  factor  for  the  QR  algorithm  that  depends 
on  the  spectrum  of  ( Q  -  uR).  Since  for  most  matrices  1  <  a  <  2,  the 
computational  cost  should  be  between  32mns  and  48mns.  We  present  in 
Table  2  the  operation  counts  (flops)  and  approximate  computation  times  for 
determining  t)  on  a  CONVEX  C-l  XP.  The  operation  count  values  are 
the  median  of  a  small  sample,  and  the  time  values  are  the  maximum  of  the 
same  small  sample.  The  order  estimates  of  the  previous  paragraph  indicate 
the  importance  of  n  to  the  asymptotic  behavior  of  the  computation  time. 
Consequently,  we  fix  the  number  of  terms  in  the  Fourier  series  expansion 
m  at  80  and  the  number  of  time  values  p  at  1.  The  number  of  values  of  z 
(amounts  of  accumulated  reward)  is  fixed  at  100.  These  are  typical  values 
we  used  to  examine  l/i(x,t)  for  the  examples  in  this  paper. 


n  — * 


* 

4 

10 

40 

170 

365 

flops 

3.2x10s  4.0xl06 

1.8x10s 

l.lxlO10 

l.OxlO11 

time 

6  sec. 

15  sec. 

320  sec. 

3.1  hr. 

25  hr. 

Table  2.  Computation  Time  and  Flops  with  Different  Values  of  n 

The  effect  of  the  pmn(n  4-  q)  term  is  unimportant  when  p  <  n  and  pq  < 
n5.  Therefore,  the  increase  in  flops  becomes  approximately  ns  for  values  of 
n  >  6.  The  computation  time  for  the  larger  state  problems  is  approximate 
because  the  jobs  were  run  at  a  low  priority  and  the  CONVEX  C-l  was  not 
dedicated  to  solving  these  problems. 
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